This procedure uses the DESeq2 package to analyze RNA-seq read count data, or similar integer matrix, with a multi-factor design. Therefore, it allows for testing differential expression between two conditions while there are confounding variables. It takes an integer matrix of gene-level read counts, normalizes it across samples, compare their gene expression profiles, and run DESeq2 to test differential expression.
Genetic Modifiers in Trisomy 21 Leukemogenesis.
Number of unique RNA-seq read pairs mapped to known human genes, using data from re-run of the same samples.
The read count matrix includes 18 samples and 20410 genes after removing genes with total read counts few than . The global average of read count per gene is 3840.8 and the overall percentage of non-zero read counts is 93.88%.
Table 1. Summary statistics of read counts per sample: quantiles of read count per gene, average read count per gene, and percentage of non-zero read counts.
| Ploid | Clone | Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | Total | Positive% | |
|---|---|---|---|---|---|---|---|---|---|---|
| S41 | Trisomy21 | TMD100 | 0 | 132.00 | 2396.5 | 13514.96 | 11403.75 | 4580700 | 275840330 | 94.81 |
| S42 | Trisomy21 | TMD100 | 0 | 31.00 | 581.5 | 3443.78 | 2881.75 | 1063841 | 70287530 | 95.50 |
| S43 | Trisomy21 | TMD100 | 0 | 26.00 | 472.0 | 2900.79 | 2395.00 | 995817 | 59205187 | 94.48 |
| S50 | Trisomy21 | TMD145 | 0 | 23.00 | 439.0 | 2692.84 | 2218.50 | 729128 | 54960958 | 94.72 |
| S51 | Trisomy21 | TMD145 | 0 | 29.00 | 520.5 | 3171.65 | 2694.00 | 845270 | 64733306 | 95.67 |
| S52 | Trisomy21 | TMD145 | 0 | 16.00 | 282.0 | 1615.42 | 1398.00 | 613385 | 32970685 | 93.28 |
| S53 | Trisomy21 | TMD145 | 0 | 30.00 | 516.0 | 2808.84 | 2461.00 | 855515 | 57328377 | 94.18 |
| S54 | Trisomy21 | TMD145 | 0 | 16.00 | 313.0 | 1820.08 | 1535.00 | 471903 | 37147874 | 93.34 |
| S55 | Trisomy21 | TMD145 | 0 | 17.25 | 332.0 | 1944.37 | 1645.00 | 529318 | 39684685 | 92.69 |
| S38 | Euploid | TMD100 | 0 | 12.00 | 235.0 | 1265.51 | 1084.00 | 362660 | 25829059 | 92.87 |
| S39 | Euploid | TMD100 | 0 | 172.00 | 3189.0 | 19446.47 | 16531.50 | 5338934 | 396902392 | 94.13 |
| S40 | Euploid | TMD100 | 0 | 28.00 | 530.5 | 3090.19 | 2617.00 | 910925 | 63070763 | 94.56 |
| S44 | Euploid | TMD145 | 0 | 21.00 | 403.0 | 2523.69 | 2136.00 | 654094 | 51508424 | 93.05 |
| S45 | Euploid | TMD145 | 0 | 15.00 | 277.0 | 1492.69 | 1297.00 | 372070 | 30465716 | 93.61 |
| S46 | Euploid | TMD145 | 0 | 22.00 | 409.0 | 2312.95 | 2009.00 | 568334 | 47207280 | 95.25 |
| S47 | Euploid | TMD145 | 0 | 11.00 | 229.0 | 1467.83 | 1210.75 | 399776 | 29958415 | 92.54 |
| S48 | Euploid | TMD145 | 0 | 21.00 | 380.0 | 2504.53 | 2097.00 | 584805 | 51117526 | 94.75 |
| S49 | Euploid | TMD145 | 0 | 7.00 | 161.0 | 1117.76 | 940.00 | 353517 | 22813438 | 90.49 |
DESeq2 uses the median ratio to estimate a size (normalization) effect for each sample in an RNA-seq data set. Size effects are positively correlated to total read counts of samples. Therefore, a simple normalization method is to divide all read counts of each sample by its size factor.
The size factors of all samples range between 0.394 and 7.343 (geometric mean = 1.004). In general, we expect a positive correlation between total read count of a sample and its size factor, and their geometric mean is close to 1.0.
Figure 1. Left: correlation between median read count per gene and size factor across all samples. Right: the distribution of size factors by groups. Average size effects of the 2 groups are 1.42 vs. 1.47 (p=0.9499).
After dividing the original read counts of each sample by its size factor, normalized read counts are log2-transformed (log2(c+1)). Data before and after normalization is compared in terms of their correlation, distribution, and variance across samples.
Figure 2. Gene-level read count of two samples with the smallest (S49) and the largest (S39) size factors.
Figure 3. Distribution of read counts before and after normalization with size factors. Read counts were log2-transformed.
Figure 4. Distribution of averaged gene-level read counts after normalization by size factors. The averages often have a bi-modal (two-peak) distribution. The narrow left peak corresponds to inactive genes with lower read counts and the wider right peak corresponds to active genes expressed at different levels.
Regularized log (Rlog) transformation is another funtion implemented by the DESeq2 package to transform the original data to reduce systemic bias between RNA-seq libraries. It does 3 things:
The Rlog-transformed data has more “linear” distribution and will be used for some of the following steps, such as hierarchical clustering and PCA.
Figure 5. Data distribution after Rlog-transformation: average of all samples (left) and individual samples (right).
Figure 6. MA-plot shows the relationship between average read count and variance across samples. In the original data, genes with lower read counts have more noise and then higher variance across samples. The default DESeq2 normalization rescales the data with library size factors and the dependence remains (left). The Rlog transfromation applies a variance stablization step to penalize genes with lower read counts, so the dependence is gone (right).
Analysis of samples by comparing their global gene expression profiles, potentially to identify outlier samples and confounding variables.
Figure 7. Unsupervised hierarchical clustering of samples based on their pairwise correlation, using Rlog-transformed data of all genes. The lowest common node of two samples on the clustering tree indicate their similarity (lower = more similar). Colors in the heatmap correspond to correlation coefficient of a sample pair.
Figure 8. Principal Components Analysis (PCA) is also an unsupervised analysis that converts a large number of correlated variables (genes) into a smaller set of uncorrelated variables called principal components (PCs). Each principal component accounts for certain percentage of total variance in a data set and the PCs are ordered by their percentages. This figure plots the top two PCs. Generally speaking, samples closer to each other on the 2-D space have more similar profiles of gene expression.
The analysis of differential gene expression reports a statistical table with 6 columns:
Figure 9. Visualizations that summarize the differential expression of all 20410 genes.
Table 2. Number of top DEGs selected via different FDR cutoffs. FDRs are calculated using the Benjamini & Hochberg method (Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B 57, 289–300. 1995). A FDR of 0.15 to 0.25 (15% to 25% false positives) is acceptable for most follow-up analyses, while validation of differential expression via other technologies, such as qPCR, is highly recommended.
| FDR | Higher_in_Euploid | Lower_in_Euploid | Total |
|---|---|---|---|
| 0.01 | 88 | 130 | 218 |
| 0.02 | 114 | 174 | 288 |
| 0.05 | 189 | 359 | 548 |
| 0.10 | 317 | 595 | 912 |
| 0.15 | 450 | 817 | 1267 |
| 0.20 | 587 | 1000 | 1587 |
| 0.25 | 779 | 1256 | 2035 |
Figure 10. The top-ranked gene with the most up- (left) and down- (right) regulated expression in Euploid. Plotting data was normalized by Rlog, and then adjusted for Clone.
Figure 11. Heatmap of the top 100 and 100 DEGs with the most up- (red) and down- (yellow) regulated expression in Euploid. Each row corresponds a DEG and the colors represent its replative expression levels across samples. Samples are clustered by the genes and the columns are colored according to sample groups (blue=Trisomy21 and red=Euploid).
Differential gene expression:
Principal components analysis
Check out the RoCA home page for more information.
The terms to represent differential expression can be used quite confusingly. In this report, fold change refers the ratio of two group means in their unlogged form. So a fold change of 2.0 means the average of the second group is increased to twice of the average of the first group; similarly, a fold change of 0.5 means the average is reduced to half. Log2(fold change) equals to the log2-transformation of the fold change. The table below gives a few examples of the conversion of these 2 variables. Log2(fold change) is more suitable for statistical analysis since it is symmetric around 0.
Supplemental Table 1. Fold Change vs. Log(Fold Change) vs. Percentage Change
| Fold change | Log2(fold change) | Percentage change (%) |
|---|---|---|
| 0.125 | -3.000 | -87.500 |
| 0.250 | -2.000 | -75.000 |
| 0.500 | -1.000 | -50.000 |
| 0.667 | -0.585 | -33.333 |
| 0.800 | -0.322 | -20.000 |
| 1.000 | 0.000 | 0.000 |
| 1.250 | 0.322 | 25.000 |
| 1.500 | 0.585 | 50.000 |
| 2.000 | 1.000 | 100.000 |
| 4.000 | 2.000 | 300.000 |
| 8.000 | 3.000 | 700.000 |
To reproduce this report:
Find the data analysis template you want to use and an example of its pairing YAML file here and download the YAML example to your working directory
To generate a new report using your own input data and parameter, edit the following items in the YAML file:
Run the code below within R Console or RStudio, preferablly with a new R session:
if (!require(devtools)) { install.packages('devtools'); require(devtools); }
if (!require(RCurl)) { install.packages('RCurl'); require(RCurl); }
if (!require(RoCA)) { install_github('zhezhangsh/RoCAR'); require(RoCA); }
CreateReport(filename.yaml); # filename.yaml is the YAML file you just downloaded and edited for your analysis
If there is no complaint, go to the output folder and open the index.html file to view report.
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
##
## Matrix products: default
## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] xlsx_0.6.1 edgeR_3.24.1
## [3] limma_3.38.3 sva_3.30.1
## [5] genefilter_1.64.0 mgcv_1.8-26
## [7] nlme_3.1-137 DEGandMore_0.0.0.9000
## [9] snow_0.4-3 DESeq2_1.22.1
## [11] SummarizedExperiment_1.12.0 DelayedArray_0.8.0
## [13] BiocParallel_1.16.2 matrixStats_0.54.0
## [15] Biobase_2.42.0 GenomicRanges_1.34.0
## [17] GenomeInfoDb_1.18.1 IRanges_2.16.0
## [19] S4Vectors_0.20.1 BiocGenerics_0.28.0
## [21] awsomics_0.0.0.9000 rchive_0.0.0.9000
## [23] colorspace_1.4-1 gplots_3.0.1
## [25] MASS_7.3-51.1 htmlwidgets_1.5.1
## [27] DT_0.15 yaml_2.2.1
## [29] kableExtra_0.9.0 knitr_1.20
## [31] rmarkdown_1.10 RoCA_0.0.0.9000
## [33] RCurl_1.95-4.11 bitops_1.0-6
## [35] devtools_2.3.1 usethis_1.6.1
##
## loaded via a namespace (and not attached):
## [1] ellipsis_0.3.0 rprojroot_1.3-2 htmlTable_1.12
## [4] XVector_0.22.0 base64enc_0.1-3 fs_1.3.2
## [7] rstudioapi_0.11 remotes_2.2.0 bit64_0.9-7
## [10] AnnotationDbi_1.44.0 fansi_0.4.1 xml2_1.2.0
## [13] splines_3.5.1 geneplotter_1.60.0 pkgload_1.0.2
## [16] jsonlite_1.6.1 Formula_1.2-3 rJava_0.9-11
## [19] annotate_1.60.0 cluster_2.0.7-1 readr_1.3.1
## [22] compiler_3.5.1 httr_1.4.2 backports_1.1.5
## [25] assertthat_0.2.1 Matrix_1.2-15 cli_2.0.2
## [28] acepack_1.4.1 htmltools_0.4.0 prettyunits_1.1.1
## [31] tools_3.5.1 gtable_0.3.0 glue_1.3.2
## [34] GenomeInfoDbData_1.2.0 dplyr_0.8.5 Rcpp_1.0.5
## [37] vctrs_0.2.4 gdata_2.18.0 crosstalk_1.1.0.1
## [40] stringr_1.3.1 xlsxjars_0.6.1 ps_1.3.2
## [43] testthat_2.3.2 rvest_0.3.2 lifecycle_0.2.0
## [46] gtools_3.8.1 XML_3.98-1.16 zlibbioc_1.28.0
## [49] scales_1.1.1 hms_0.4.2 RColorBrewer_1.1-2
## [52] memoise_1.1.0 gridExtra_2.3 ggplot2_3.3.2
## [55] rpart_4.1-13 latticeExtra_0.6-28 stringi_1.2.4
## [58] RSQLite_2.1.1 highr_0.7 desc_1.2.0
## [61] checkmate_1.8.5 caTools_1.17.1.1 pkgbuild_1.1.0
## [64] rlang_0.4.5 pkgconfig_2.0.3 evaluate_0.14
## [67] lattice_0.20-38 purrr_0.3.3 bit_1.1-14
## [70] processx_3.4.2 tidyselect_1.0.0 magrittr_1.5
## [73] R6_2.4.1 Hmisc_4.1-1 DBI_1.0.0
## [76] pillar_1.4.6 foreign_0.8-71 withr_2.2.0
## [79] survival_2.43-3 nnet_7.3-12 tibble_2.1.3
## [82] crayon_1.3.4 KernSmooth_2.23-15 locfit_1.5-9.1
## [85] grid_3.5.1 data.table_1.12.8 blob_1.2.1
## [88] callr_3.4.3 digest_0.6.25 xtable_1.8-3
## [91] munsell_0.5.0 viridisLite_0.3.0 sessioninfo_1.1.1
END OF DOCUMENT